Pandas 数据处理(一)

您所在的位置：网站首页 › series object › Pandas 数据处理(一)

Pandas 数据处理(一)

2024-07-16 02:20| 来源: 网络整理| 查看: 265

Pandas 数据处理(一) - DataFrame 与 Series

Pandas 数据处理系列文章链接：

Pandas 数据处理(一) - DataFrame 与 Series (Mar 20, 2018) Pandas 数据处理(二) - 筛选数据 (Mar 21, 2018) Pandas 数据处理(三) - Cheat Sheet 中文版 (Jun 16, 2019)

本文示例基于 Version 0.21.0

DataFrame和Series是pandas中最常见的2种数据结构。DataFrame可以理解为Excel中的一张表，Series可以理解为一张Excel表的一行或一列数据。

一、Series

Series可以理解为一维数组，它和一维数组的区别，在于Series具有索引。

1. 创建Series 默认索引 12345678910111213141516171819202122money_series = pd.Series([200, 300, 10, 5], name="money") # 未设置索引的情况下，自动从0开始生成"""0 2001 3002 103 5Name: money, dtype: int64"""money_series[0] # 根据索引获取具体的值，# 200money_series = money_series.sort_values() # 根据值进行排序，排序后索引与值对应关系不变"""3 52 100 2001 300Name: money, dtype: int64"""money_series[0] # 根据索引获取具体的值，0对应的依旧是200，等价于 money_series.loc[0]# 200money_series.iloc[0] # 根据序号获取具体的值# 5 自定义索引 12345678910111213141516171819202122money_series = pd.Series([200, 300, 10, 5], index=['d', 'c', 'b', 'a'], name='money') """d 200c 300b 10a 5Name: money, dtype: int64"""money_series.index # 查看索引# Index(['d', 'c', 'b', 'a'], dtype='object')money_series['a'] # 根据索引获取具体的值# 5money_series = money_series.sort_index() # 根据索引排序"""a 5b 10c 300d 200Name: money, dtype: int64"""money_series.iloc[-1] # 取最后一个值# 200 2. 切片与取值根据索引 12345678910111213141516171819202122money_series = pd.Series({'d': 200, 'c': 300, 'b': 10, 'a': 5}, name='money')"""a 200b 300c 10d 5Name: money, dtype: int64"""money_series.loc['a'] # 等价于 money_series['a']# 200money_series.loc['c':'a':-1] # 从c取到 a，倒序"""c 10b 300a 200Name: money, dtype: int64"""money_series.loc[['d', 'a']] # d, a的值，等价于 money_series[['d', 'a']]"""d 5a 200""" 根据序号 1234567891011121314money_series.iloc[0]# 200money_series.iloc[1:3] # 根据序号取值，不包含结束，等价于 money_series[1:3]"""b 300c 10Name: money, dtype: int64"""money_series.iloc[[3, 0]] # 取第三个值和第一个值"""d 5a 200Name: money, dtype: int64""" 根据条件 12345678910111213money_series[money_series > 50] # 选取大于50的值"""c 300d 200Name: money, dtype: int64"""money_series[lambda x: x ** 2 > 50] # 选取值平方大于50的值"""b 10c 300d 200Name: money, dtype: int64""" 二、DataFrame1. 创建DataFrame 从字典中创建 123456789101112131415161718# 字典值等长# 不指定 indexdf = pd.DataFrame({'单价': [100, 200, 30], '数量': [3, 3, 10]}) """ 单价数量0 100 31 200 32 30 10"""# 指定 indexdf = pd.DataFrame({'单价': [100, 200, 30], '数量': [3, 3, 10]}, index=['T001', 'T002', 'T003']) """ 单价数量T001 100 3T002 200 3T003 30 10""" 通过Series创建 123456789101112price_series = pd.Series([100, 200, 30], index=['T001', 'T002', 'T005'])quantity_series = pd.Series([3, 3, 10, 2], index=['T001', 'T002', 'T003', 'T004'])df = pd.DataFrame({'单价': price_series, '数量': quantity_series})# 数据中不含有对应元素，则置为NaN""" 单价数量T001 100.0 3.0T002 200.0 3.0T003 NaN 10.0T004 NaN 2.0T005 30.0 NaN""" 从Excel文件中读取，demo.dat 123df = pd.read_excel("path/demo.xlsx", sheetname=0)# 指定 sheetnamedf = pd.read_excel("path/demo.xlsx", sheetname='销售记录') 从普通文本中读取 123456编号|日期|单价|数量T001|2018-03-02 12:34:05|100|3T002|2018-03-02 13:04:05|200|3T003|2018-03-03 18:12:31|30|10T004|2018-03-04 20:34:05|400|2T005|2018-03-02 20:34:05|500|1 1234567891011121314151617181920df = pd.read_csv('demo.dat', delimiter='|') # csv默认是逗号分隔的，如果不是，需要指定delimiter""" 编号日期单价数量0 T001 2018-03-02 12:34:05 100 31 T002 2018-03-02 13:04:05 200 32 T003 2018-03-03 18:12:31 30 103 T004 2018-03-04 20:34:05 400 24 T005 2018-03-02 20:34:05 500 1"""df = pd.read_csv('demo.dat', delimiter='|', index_col='编号') # index_col指定行标签为索引""" 日期单价数量编号 T001 2018-03-02 12:34:05 100 3T002 2018-03-02 13:04:05 200 3T003 2018-03-03 18:12:31 30 10T004 2018-03-04 20:34:05 400 2T005 2018-03-02 20:34:05 500 1""" 2. 获取列与行1234567891011121314151617181920212223242526272829303132333435df['日期'] # -> 返回Series"""0 2018-03-02 12:34:051 2018-03-02 13:04:052 2018-03-03 18:12:313 2018-03-04 20:34:054 2018-03-02 20:34:05Name: 日期, dtype: object"""df[['单价', '数量']] # -> 返回Series""" 单价数量0 100 31 200 32 30 103 400 24 500 1"""df.loc['T001'] # 按行标签获取，返回Seriesdf.iloc[0] # 按行号获取，返回Series"""日期 2018-03-02 12:34:05单价 100数量 3Name: T001, dtype: object"""df.head(3) # 前三行df.tail(3) # 后三行""" 日期单价数量编号 T003 2018-03-03 18:12:31 30 10T004 2018-03-04 20:34:05 400 2T005 2018-03-02 20:34:05 500 1""" 3. 修改单价 * 2 12345678910111213df['单价'] *= 2# apply支持传入修改函数，能处理更复杂的场景# 等价于， df['单价'] = df.apply(lambda x: x['单价'] * 2, axis=1)""" 日期单价数量编号 T001 2018-03-02 12:34:05 200 3T002 2018-03-02 13:04:05 400 3T003 2018-03-03 18:12:31 60 10T004 2018-03-04 20:34:05 800 2T005 2018-03-02 20:34:05 1000 1""" 编号加上前缀 12345678910# 由于编号是索引，所以需要用 df.index去访问df.index = '2018_' + df.index""" 日期单价数量2018_T001 2018-03-02 12:34:05 200 32018_T002 2018-03-02 13:04:05 400 32018_T003 2018-03-03 18:12:31 60 102018_T004 2018-03-04 20:34:05 800 22018_T005 2018-03-02 20:34:05 1000 1""" 数量小于3的记录，单价 + 10 123456789101112131415def change_price(x): if x['数量'] < 3: return x['单价'] + 10 return x['单价']df['单价'] = df.apply(change_price, axis=1)""" 日期单价数量2018_T001 2018-03-02 12:34:05 200 32018_T002 2018-03-02 13:04:05 400 32018_T003 2018-03-03 18:12:31 60 102018_T004 2018-03-04 20:34:05 810 22018_T005 2018-03-02 20:34:05 1010 1""" 增加物流公司 12345678910111213141516171819df['运费'] = pd.Series({'2018_T001': 10, '2018_T005': 12})""" 日期单价数量运费2018_T001 2018-03-02 12:34:05 200 3 10.02018_T002 2018-03-02 13:04:05 400 3 NaN2018_T003 2018-03-03 18:12:31 60 10 NaN2018_T004 2018-03-04 20:34:05 810 2 NaN2018_T005 2018-03-02 20:34:05 1010 1 12.0"""# 缺少信息的部分填充为0df.fillna(0)""" 日期单价数量运费2018_T001 2018-03-02 12:34:05 200 3 10.02018_T002 2018-03-02 13:04:05 400 3 0.02018_T003 2018-03-03 18:12:31 60 10 0.02018_T004 2018-03-04 20:34:05 810 2 0.02018_T005 2018-03-02 20:34:05 1010 1 12.0""" 4. 删除删除日期列（就地删除） 123456789del df['日期']""" 单价数量运费2018_T001 200 3 10.02018_T002 400 3 NaN2018_T003 60 10 NaN2018_T004 810 2 NaN2018_T005 1010 1 12.0""" 删除运费列（返回筛选后的） 1234567891011new_columns = list(df.columns)new_columns.remove('运费')df = df[new_columns]""" 单价数量2018_T001 200 32018_T002 400 32018_T003 60 102018_T004 810 22018_T005 1010 1""" 附推荐一篇文章入门 Python

专题: Pandas 数据处理

本文发表于 2018-03-20，最后修改于 2023-11-15。

本站永久域名「 geektutu.com 」，也可搜索「极客兔兔」找到我。

上一篇 « TensorFlow入门(二) - mnist手写数字识别(模型保存加载) 下一篇 » Pandas 数据处理(二) - 筛选数据

【本文地址】

Pandas 数据处理(一)

Pandas 数据处理(一)

今日新闻

推荐新闻